---
title: Reading a Multi-Line Text File into a Single Table Row
---

You can use the PXF HDFS connector to read one or more multi-line text files in HDFS each as a single table row. This may be useful when you want to read multiple files into the same Greenplum Database external table, for example when individual JSON files each contain a separate record.

PXF supports reading only text and JSON files in this manner.

**Note**: Refer to the [Reading JSON Data from HDFS](hdfs_json.html) topic if you want to use PXF to read JSON files that include more than one record.


## <a id="prereq"></a>Prerequisites

Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read files from HDFS.


## <a id="fileasrow"></a>Reading Multi-Line Text and JSON Files

You can read single- and multi-line files into a single table row, including files with embedded linefeeds. If you are reading multiple JSON files, each file must be a complete record, and each file must contain the same record type.

PXF reads the complete file data into a single row and column. When you create the external table to read multiple files, you must ensure that all of the files that you want to read are of the same (text or JSON) type. You must also specify a single `text` or `json` column, depending upon the file type.

The following syntax creates a Greenplum Database readable external table that references one or more text or JSON files on HDFS:

``` sql
CREATE EXTERNAL TABLE <table_name>
    ( <column_name> text|json | LIKE <other_table> )
  LOCATION ('pxf://<path-to-files>?PROFILE=hdfs:text:multi[&SERVER=<server_name>][&IGNORE_MISSING_PATH=<boolean>]&FILE_AS_ROW=true')
FORMAT 'CSV');
```

The keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://gpdb.docs.pivotal.io/latest/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.

| Keyword  | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;files\>    | The path to the directory or files in the HDFS data store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path&#8209;to&#8209;hdfs&#8209;files\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path&#8209;to&#8209;files\> must not specify a relative path nor include the dollar sign (`$`) character. |
| PROFILE    | The `PROFILE` keyword must specify `hdfs:text:multi`. |
| SERVER=\<server_name\>    | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
| FILE\_AS\_ROW=true    | The required option that instructs PXF to read each file into a single table row. |
| IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-files\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
| FORMAT | The `FORMAT` must specify `'CSV'`.  |

**Note**: The `hdfs:text:multi` profile does not support additional custom or format options when you specify the `FILE_AS_ROW=true` option.

For example, if `/data/pxf_examples/jdir` identifies an HDFS directory that contains a number of JSON files, the following statement creates a Greenplum Database external table that references all of the files in that directory:

``` sql
CREATE EXTERNAL TABLE pxf_readjfiles(j1 json)
  LOCATION ('pxf://data/pxf_examples/jdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
```

When you query the `pxf_readjfiles` table with a `SELECT` statement, PXF returns the contents of each JSON file in `jdir/` as a separate row in the external table.

When you read JSON files, you can use the JSON functions provided in Greenplum Database to access individual data fields in the JSON record. For example, if the `pxf_readjfiles` external table above reads a JSON file that contains this JSON record:

``` json
{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      }
    }
  ]
}
```

You can use the `json_array_elements()` function to extract specific JSON fields from the table row. For example, the following command displays the `user->id` field:

``` sql
SELECT json_array_elements(j1->'root')->'record_obj'->'user'->'id'
  AS userid FROM pxf_readjfiles;

  userid  
----------
 31424214
(1 rows)
```

Refer to [Working with JSON Data](https://gpdb.docs.pivotal.io/latest/admin_guide/query/topics/json-data.html) in the Greenplum Documentation for specific information on manipulating JSON data in Greenplum.


### <a id="example_fileasrow"></a>Example: Reading an HDFS Text File into a Single Table Row

Perform the following procedure to create 3 sample text files in an HDFS directory, and use the PXF `hdfs:text:multi` profile and the default PXF server to read all of these text files in a single external table query.


1. Create an HDFS directory for the text files. For example:

    ``` shell
    $ hdfs dfs -mkdir -p /data/pxf_examples/tdir
    ```

2. Create a text data file named `file1.txt`:

    ``` shell
    $ echo 'text file with only one line' > /tmp/file1.txt
    ```

3. Create a second text data file named `file2.txt`:

    ``` shell
    $ echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > /tmp/file2.txt
    ```

    This file has multiple lines.

4. Create a third text file named `/tmp/file3.txt`:

    ``` shell
    $ echo '"4627 Star Rd.
San Francisco, CA  94107":Sept:2017
"113 Moon St.
San Diego, CA  92093":Jan:2018
"51 Belt Ct.
Denver, CO  90123":Dec:2016
"93114 Radial Rd.
Chicago, IL  60605":Jul:2017
"7301 Brookview Ave.
Columbus, OH  43213":Dec:2018' > /tmp/file3.txt
    ```

    This file includes embedded line feeds.

5. Save the file and exit the editor.

6. Copy the text files to HDFS:

    ``` shell
    $ hdfs dfs -put /tmp/file1.txt /data/pxf_examples/tdir
    $ hdfs dfs -put /tmp/file2.txt /data/pxf_examples/tdir
    $ hdfs dfs -put /tmp/file3.txt /data/pxf_examples/tdir
    ```

7. Log in to a Greenplum Database system and start the `psql` subsystem.

8. Use the `hdfs:text:multi` profile to create an external table that references the `tdir` HDFS directory. For example:

    ``` sql
    CREATE EXTERNAL TABLE pxf_readfileasrow(c1 text)
      LOCATION ('pxf://data/pxf_examples/tdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
    FORMAT 'CSV';
    ```
    
9. Turn on expanded display and query the `pxf_readfileasrow` table:

    ``` sql
    postgres=# \x on
    postgres=# SELECT * FROM pxf_readfileasrow;
    ```

    ``` pre
    -[ RECORD 1 ]---------------------------
    c1 | Prague,Jan,101,4875.33
       | Rome,Mar,87,1557.39
       | Bangalore,May,317,8936.99
       | Beijing,Jul,411,11600.67
    -[ RECORD 2 ]---------------------------
    c1 | text file with only one line
    -[ RECORD 3 ]---------------------------
    c1 | "4627 Star Rd.
       | San Francisco, CA  94107":Sept:2017
       | "113 Moon St.
       | San Diego, CA  92093":Jan:2018
       | "51 Belt Ct.
       | Denver, CO  90123":Dec:2016
       | "93114 Radial Rd.
       | Chicago, IL  60605":Jul:2017
       | "7301 Brookview Ave.
       | Columbus, OH  43213":Dec:2018
    ```

